[SPARK-32428] [EXAMPLES] Make BinaryClassificationMetricsExample cons…#29222
Closed
titsuki wants to merge 2 commits intoapache:masterfrom
Closed
[SPARK-32428] [EXAMPLES] Make BinaryClassificationMetricsExample cons…#29222titsuki wants to merge 2 commits intoapache:masterfrom
titsuki wants to merge 2 commits intoapache:masterfrom
Conversation
…istently print the metrics on driver's stdout
Contributor
|
The change looks reasonable to me. I checked the examples, most of the examples have RDD.collect().foreach, but a few of the examples have RDD.foreach. For example, in I think we probably also want to change these to make all the examples to output the result on the driver's stdout. |
Contributor
|
cc @srowen |
srowen
reviewed
Jul 25, 2020
Member
srowen
left a comment
There was a problem hiding this comment.
Yep let's fix all such occurrences if possible. Thanks!
How to did it:
+ 1. Replace all occurences of `.collect` with `foreach.collect`:
```
$ find examples/src/ -type f | xargs grep foreach | grep -v foreachRDD | grep -P -v "(collect.foreach|collect\(\).foreach)" | awk '{ print $1 }' | sed -e 's/:$//' | uniq | grep scala | xargs sed -i -e 's/foreach/collect.foreach/g'
```
+ 2. For each file, check if the modification was correct or not by `mvn compile` and call `checkout --` if the modification was incorrect:
```
$ mvn compile | grep Error | awk '{ print $3 }' | perl -plne 's/:(\d+):$//' | xargs -i git checkout -- {}
```
+ 3. Manually call `checkout --` if the modification seems superfluous:
We removed AccumulatorMetricsTest.scala and ExceptionHandlingTest.scala from the target.
Contributor
Author
|
@huaxingao @srowen |
Member
|
+1 |
Member
|
Jenkins test this please |
|
Test build #126548 has finished for PR 29222 at commit
|
huaxingao
approved these changes
Jul 25, 2020
srowen
pushed a commit
that referenced
this pull request
Jul 26, 2020
…istently print the metrics on driver's stdout ### What changes were proposed in this pull request? Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout. ### Why are the changes needed? Some RDDs in this example (e.g., precision, recall) call println without calling collect. If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout. However if the job is under cluster mode, the job prints the metrics on the executor's stdout. It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout. All of the metrics should output its result on the driver's stdout. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is example code. It doesn't have any tests. Closes #29222 from titsuki/SPARK-32428. Authored-by: Itsuki Toyota <titsuki@cpan.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 86ead04) Signed-off-by: Sean Owen <srowen@gmail.com>
srowen
pushed a commit
that referenced
this pull request
Jul 26, 2020
…istently print the metrics on driver's stdout ### What changes were proposed in this pull request? Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout. ### Why are the changes needed? Some RDDs in this example (e.g., precision, recall) call println without calling collect. If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout. However if the job is under cluster mode, the job prints the metrics on the executor's stdout. It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout. All of the metrics should output its result on the driver's stdout. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is example code. It doesn't have any tests. Closes #29222 from titsuki/SPARK-32428. Authored-by: Itsuki Toyota <titsuki@cpan.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 86ead04) Signed-off-by: Sean Owen <srowen@gmail.com>
Member
|
Merged to master/3.0/2.4 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…istently print the metrics on driver's stdout
What changes were proposed in this pull request?
Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout.
Why are the changes needed?
Some RDDs in this example (e.g., precision, recall) call println without calling collect.
If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout.
However if the job is under cluster mode, the job prints the metrics on the executor's stdout.
It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout.
All of the metrics should output its result on the driver's stdout.
Does this PR introduce any user-facing change?
No
How was this patch tested?
This is example code. It doesn't have any tests.